An EDA of Human Rights Variables in the V-Dem Dataset
Data Science 1 with R (STAT 301-1)
1 Introduction
1.1 Data sources
For this report, I aim to merge and analyze two datasets: Varieties of Democracy (University of Gothenburg) and Human Rights Scores (Christopher Fariss, University of Michigan). The former is readily accessible as the R package vdemdata,1 whereas the latter can be found on Fariss’s Dataverse page. The two datasets are in a country-year panel format and can be easily merged on the basis of year and country codes.2
1.2 Why this data
To be very brief, much of my dissertation relies on the use of a dependent variable measuring human rights respect. Hitherto, I have depended almost exclusively on Fariss’s Human Rights (HR) Scores, but he has not updated his dataset since 2020. Therefore, I would like to see if there are any variables in the V-Dem dataset—widely regarded as one of the most rigorous and comprehensive datasets available to political scientists—that co-vary strongly with Fariss’s variable, enabling me to account for more recent country-year observations, or to help me impute the missing data in HR scores.
1.3 Overview of report
My report proceeds as follows. First, I demonstrate (briefly) that my data are sufficiently high-quality and complex (Section 2). Subsequently, I conduct my EDA (Section 3), focusing on the covariance between HR scores and:
- Physical Violence Index (Section 3.1)
- Physical Violence Index Ordinal (Section 3.2)
- Physical Violence Index’s constitutive variables (Section 3.3):
- Freedom from Political Killings
- Freedom from Torture
- Physical Violence Index Raw (Section 3.4)
I conclude by discussing my findings and potential next steps for my research (Section 4).
My report also contains an extensive appendix (Section 5), wherein I give analyses of additional HR-related variables. I include these to buttress my EDA, as I explain where appropriate.
2 Data quality & complexity check
As aforementioned, the datasets are easy to access, load, and ultimately merge. Upon merging, we attain an exceedingly large and complex dataset, with 4603 variables (4573 numeric vs. 30 categorical) and 28353 observations. This does not mean, however, that I cannot simplify my scope of analysis. Because my research deals exclusively with post-World War II developments—namely, preferential trade agreements, bilateral investment treaties, and social media use by governments—I can safely dismiss all prewar observations, reducing the number of rows to 13520. Furthermore, because the dataset covers an extraordinary miscellany of metrics assessing quality of governance, I can focus my attention specifically on those that directly or indirectly purport to capture levels of human rights respect. Accordingly, I locate such variables by conducting a simple search through the V-Dem codebook.3
Importantly, the “postwar” version of my dataset remains affected by missingness, with 99.96% of all variables and 100% of all observations featuring missingness, and 30.48% of all values being missing. This includes, as aforementioned, all values for hr_score where year >= 2020. Unfortunately, this widespread missingness contributed to my inability to use machine learning methods to locate additional variables associated with HR scores in the context of this assignment (see Section 4).
3 Explorations
Owing to the aforementioned search through the V-Dem codebook,4 I find the physical violence index, which aims to capture the “extent [to which] physical integrity [is] respected,”5 to be the most similar to HR scores conceptually. My EDA therefore focuses on the variants of said index and its inputs.
3.1 Physical Violence Index
The main version of the physical violence index (v2x_clphy) is an average of the “freedom from political killings” and “freedom from torture” indicators.6 Like HR scores, higher values are associated with greater respect for physical integrity. However, whereas HR scores is scaled to possess a mean of 0, the physical violence index is fit to a 0-to-1 scale.
The total number of missing data points is 798; this represents about 5.9% of all data points for the variable. This overall rate of missingness is relatively low, but it is nonetheless higher in some cases than than others:
As we can see, the top of the list shown in Table 1, above, is dominated by small island nations and European microstates. For these cases, then, the physical violence index seems to be a suboptimal substitute for HR scores.7
Nevertheless, the list is topped by the Czech Republic (formerly of Czechoslovakia), which didn’t exist in its current form until 1993, hence the missing datapoints. Therefore, it is perhaps justifiable to rectify this missingness by simply substituting Czechoslovakia’s data for the Czech Republic’s missing data.8
Below is Figure 1, a scatterplot displaying the covariance between the physical violence index and HR scores:9
What we see is unsurprising: the relationship between the two variables is positive. However, it is unclear whether this relationship can be deemed “linear,” for it appears as though the two metrics might co-vary strongest at the ends of the human-rights-respect spectrum, where the datapoints seem to “cluster.” Therefore, while the physical violence index may not be a poor substitute/imputation input for HR scores, it may also be less convenient vis-à-vis alternatives. As such, I proceed in part by showing how a simple average between “freedom from political killings” and “freedom from torture” might be better suited for our needs (see Section 3.3 & Section 3.4).
(Helpful to note is that two related variables, the “equality before the law and individual liberty” and the “civil liberty” indices, demonstrate similar patterns, if in even more pronounced ways. For this reason, I have deemed them even less suitable than the political violence index, relegating my analyses thereof to the appendix. For more, see Section 5.1 & Section 5.3.)
3.2 Physical Violence Index Ordinal
A cognate variable featured in the V-Dem dataset is an “ordinalized [version]” of the physical violence index (e_v2x_clphy).10 This variable, the “physical violence index ordinal,” appears in three forms: three-point, four-point, and five-point versions (*_3C, *_4C, *_5C).11
To appraise its suitability as a substitute/imputation input for HR scores, I check whether its relationship to HR scores is similar to that of the unordinalized physical violence index (Section 3.1). To this end, I furnish side-by-side boxplots below (Figure 2):
Above, we see that the relationship is indeed similar: irrespective of the number of points, the political violence index ordinal is positively associated with HR scores. What’s more, for the four- and five-point plots, the quantiles appear to “jump” after the first level and at the final level. This seems to corroborate our prior observation that, with respect to the covariance between the physical violence index and HR scores, there is “clustering” of datapoints at the lower and upper ends of the human-rights-respect spectrum (see Figure 1).
Ultimately, then, the relationship between the ordinalized physical violence index and HR scores is similar to that between the unordinalized physical violence index and HR scores. Consequently, it would seem that each version is comparable as a substitute/imputation input for HR scores—not a poor choice, but perhaps not quite as optimal as others.
(The “equality before the law and individual liberty” and “civil liberty” indices also feature ordinalized versions; but once again, they demonstrate similar patterns, if in even more pronounced ways. For more, see Section 5.2 & Section 5.4.)
3.3 Physical Violence Index’s constitutive variables
If the physical violence index and its variants aren’t “quite” as ideal as could be, then what might be “better”? To wit, are there any variables that exhibit a more clearly linear relationship with HR scores?
Thankfully, we needn’t look far. As aforementioned, the physical violence index is an average of the “freedom from political killings” (v2clkill) and “freedom from torture” (v2cltort) indicators.12 These inputs differ from their output, however, in that they are not fit to a 0-to-1 scale, being raw “latent scores” that are allowed to take on both positive and negative values.13
Below is Figure 3, which displays side-by-side scatterplots of the relationship between each of these variables and HR scores:
As we can see, the unadjusted scale seems to enable a more linear relationship between the physical-integrity-rights-respect indicators and HR scores. Therefore, it would seem that an unadjusted form of the physical violence index—that is, a raw average of the political killings and torture indicators—might give us the most “ideal” substitute/imputation input for HR scores.
3.4 Physical Violence Index Raw
Motivated by this logic, I generate a “raw” version of the physical violence index (avg_kill_tort), which—as aforementioned—simply averages the political killings and torture indicators. Below, then, is Figure 4, which gives the relationship between said variable and HR scores in scatterplot form:
Herein, we see a relationship that appears more linear relative to that shown in Figure 1. In my estimation, then, it would seem that this unadjusted version of the physical violence index is the best substitute/imputation input for HR scores vis-à-vis the other variables I’ve analyzed for this report.
4 Conclusion
In toto, my preliminary EDA suggests that while the physical violence index and its variants may not be poor substitutes/imputation inputs for HR scores, a “raw” (i.e., unadjusted) version of said index may be my best option.
In my second progress memo, I stated my intention to implement a lasso regression in order to locate additional variables associated with HR scores. Justifying my desire was a belief that searching for such variables in an “unbiased” manner (i.e., automated and unaffected by my prior beliefs as to what constitutes a “human rights” variable) would lend credibility to my findings. Unfortunately—owing to widespread missingness in the V-Dem dataset (see Section 2),14 but especially to my inability to account for country and year fixed effects—I ultimately could not implement a lasso that produced results I deemed credible. Though I still believe in the helpfulness of completing a lasso and aim to do so eventually, I cannot proceed until I become able to deal with the problem of missingness (perhaps with the aid of multiple imputation, which itself would require the aid of a supercomputer, such as Northwestern’s Quest, given the scale of the dataset) and incorporate fixed effects into my model. Clearing these hurdles will obviously require more research and skills-acquisition on my part.
I also expressed a desire in my second progress memo to find additional covariates by analyzing outliers in my plots. Although time constraints prevented me from completing this task in full, I was able to at least identify a few.15 Perhaps the most visible outlier is Mongolia, which for many year-observations received a high HR score but a below-average physical violence score.16 By contrast, there exist some year-observations, such as for Sudan and Portugal, that received high physical violence scores yet low HR scores. I do not have extensive country-level knowledge in these instances, so to understand their score discrepancy, a natural next step would be to conduct some (brief) historical research and review the sources informing each of these country-year’s scores.
5 Appendix: Analyses of Additional HR-Related Variables
5.1 Equality before the Law and Individual Liberty Index
The equality before the law and individual liberty index (v2xcl_rol) seeks to answer: “To what extent are laws transparent and rigorously enforced and public administration impartial, and to what extent do citizens enjoy access to justice, secure property rights, freedom from forced labor, freedom of movement, physical integrity rights, and freedom of religion?”17 This variable is therefore more capacious than the physical violence index; but it does capture physical integrity rights, being partially comprised of the latter’s inputs.18
Below is Figure 5, a scatterplot illustrating the relationship between the equality before the law and individual liberty index, on the one hand, and HR scores, on the other:
The relationship we see is similar to that between the political violence index and HR scores, being manifestly positive but not exactly linear (see Figure 1). What’s more, the spread of the data seems greater in comparison, indicating a greater degree of variance.
5.2 Equality before the Law and Individual Liberty Index Ordinal
As aforementioned (see Section 3.2), there also exists an ordinalized equality before the law and individual liberty index (e_v2xcl_rol), with three-, four-, and five-point versions (*_3C, *_4C, *_5C). Below are side-by-side boxplots for each:
As with the physical violence index ordinal (see Figure 2), we see a positive relationship with HR scores, although the quantiles for each level (particularly those in the “middle” of the point scale) seem to rise more incrementally in comparison, suggesting a relationship that is “less strong” in the middle of the distribution.
5.3 Civil Liberties Index
The purpose of the civil liberties index (v2x_civlib) is, naturally, to capture the “extent [to which] civil liberty [is] respected.”19 To V-Dem, this concept includes physical integrity rights, so the variable is comprised in part by the physical violence index.20
Below is the scatterplot of the civil liberties index and HR scores (Figure 7):
The shape of the distribution is similar to what we’ve seen in Figure 1 (physical violence index) and Figure 5 (equality before the law and individual liberty index) in being positive but not precisely linear. In fact, of the three, the relationship between the civil liberties index and HR scores appears the least linear insofar as the right-end of the spectrum “shoots” upwards the most noticeably.
5.4 Civil Liberties Index Ordinal
As the physical violence index and equality before the law and individual liberty index each have an ordinalzed version, so too does the civil liberties index (e_v2x_civlib). Figure 8, below, contains boxplots for each number of points (*_3C, *_4C, *_5C):
Like Figure 2 and Figure 6, we see a positive relationship with HR scores; but relative to the latter, we see that the quantiles for each level—particularly those in the “middle” of the point scale—rise in an even more incremental fashion.
5.5 Political Violence Indicator
The political violence indicator (v2caviol) endeavors to answer the question: “How often have non-state actors used political violence against persons this year?”21 It therefore helps to capture instances of physical integrity rights disrespect; but it differs significantly from HR scores and the physical violence index in being more restrictive, limiting its coverage to the behavior of non-state actors, exclusively.
The total number of missing data points is 960; this represents about 7.1% of all data points for the variable. This overall rate of missingness is low yet comparatively higher than that of the physical violence index and its associated variables.22 In addition, some cases feature higher rates of missingness than others:
At the top of the list shown in Table 2, above, we again see an abundance of small island nations and European microstates, as well as the Czech Republic.23 The widespread missingness in the Republic of Vietnam and the Yemen People’s Republic can easily be explained by the fact that neither state has existed for quite some time. However, the appearance of Afghanistan—a country which has not only persisted into the modern era, but also suffered an exceeding degree of violence—casts a measure of doubt on the variable’s reliability.
Below is scatterplot of the relationship between the political violence indicator and HR scores (Figure 9):
Here, we see a negative relationship, because a higher score on the political violence indicator signifies a greater degree of abuse. However, compared to the physical violence index and its associated variables, we see more variance along the spectrum, particularly at the left end, where the human-rights “respecters” generally lie.
Ultimately, on account of its limited theoretical coverage, more concerning patterns of missingness, and greater variance, the political violence indicator seems (in my estimation) the least optimal substitute/imputation input for HR scores, at least of the variables that I analyzed for this report.
Footnotes
See the V-Dem Institute’s Github repo for installation instructions.↩︎
The country codes, specifically, are Correlates of War (COW) IDs.↩︎
I specifically do so by searching for terms relating to, inter alia, “human rights” and respect for “physical integrity.”↩︎
V-Dem Codebook, 2023, p. 297.↩︎
See ibid.↩︎
And indeed, HR scores possesses values for virtually all of these countries pre-2020.↩︎
Note that, because they constitute or derive from the physical violence index, the four variables I analyze in the body of this report—the physical violence index ordinal, the political killings and torture indicators, and the raw physical violence index—possess the same missing datapoints and hence the same rates of missingness. What’s more, save for the political violence indicator, all the variables I analyze in the appendix exhibit the selfsame missingness. For evidence, see the R scripts
Key_Brian_final_report.RandKey_Brian_Progress_Memo_2.R.↩︎Throughout, I include and analyze scatterplots—rather than heatmaps, which I produced in my second progress memo—because they enable me to better identify salient outliers. I will discuss some of these in the conclusion (Section 4).↩︎
V-Dem Codebook, 2023, p. 357.↩︎
Ibid.↩︎
See Section 3.1.↩︎
Indeed, the lasso model I used from the R package
glmnetdoes not allow for any missing values.↩︎See my outlier checks in the R script
Key_Brian_final_report.R.↩︎See the isolated cluster of datapoints in the left-half of and well-above the line of best fit in Figure 1 & Figure 4. Also recall from Section 3.1 that high physical violence and HR scores indicate “greater” rights respect.↩︎
V-Dem Codebook, 2023, p. 51.↩︎
As discussed in Section 3.1, these inputs are the freedom from political violence and freedom from torture indicators. See ibid.↩︎
V-Dem Codebook, 2023, p. 296.↩︎
See ibid.↩︎
V-Dem Codebook, 2023, p. 226.↩︎
As mentioned in an earlier footnote, the missingness seen in the physical violence index and the other variables analyzed in this appendix is exactly the same; the political violence indicator is thus an outlier in this respect.↩︎